强化学习中的信用作业是衡量行动对未来奖励的影响的问题。特别是,这需要从运气中分离技能,即解除外部因素和随后的行动对奖励行动的影响。为实现这一目标,我们将来自因果关系的反事件的概念调整为无模型RL设置。关键思想是通过学习从轨迹中提取相关信息来应对未来事件的价值函数。我们制定了一系列政策梯度算法,这些算法使用这些未来条件的价值函数作为基准或批评,并表明它们是可怕的差异。为避免对未来信息的调理潜在偏见,我们将后视信息限制为不包含有关代理程序行为的信息。我们展示了我们对许多说明性和具有挑战性问题的算法的功效和有效性。
translated by 谷歌翻译
我们介绍并分析新的一阶优化算法系列,它概括并统一镜像血统和双平均。在该系列的框架内,我们定义了用于约束优化的新算法,这些算法结合了镜像血统和双平均的优点。我们的初步仿真研究表明,这些新算法在某些情况下显着优于可用方法。
translated by 谷歌翻译
Non-linear state-space models, also known as general hidden Markov models, are ubiquitous in statistical machine learning, being the most classical generative models for serial data and sequences in general. The particle-based, rapid incremental smoother PaRIS is a sequential Monte Carlo (SMC) technique allowing for efficient online approximation of expectations of additive functionals under the smoothing distribution in these models. Such expectations appear naturally in several learning contexts, such as likelihood estimation (MLE) and Markov score climbing (MSC). PARIS has linear computational complexity, limited memory requirements and comes with non-asymptotic bounds, convergence results and stability guarantees. Still, being based on self-normalised importance sampling, the PaRIS estimator is biased. Our first contribution is to design a novel additive smoothing algorithm, the Parisian particle Gibbs PPG sampler, which can be viewed as a PaRIS algorithm driven by conditional SMC moves, resulting in bias-reduced estimates of the targeted quantities. We substantiate the PPG algorithm with theoretical results, including new bounds on bias and variance as well as deviation inequalities. Our second contribution is to apply PPG in a learning framework, covering MLE and MSC as special examples. In this context, we establish, under standard assumptions, non-asymptotic bounds highlighting the value of bias reduction and the implicit Rao--Blackwellization of PPG. These are the first non-asymptotic results of this kind in this setting. We illustrate our theoretical results with numerical experiments supporting our claims.
translated by 谷歌翻译
This paper introduces a novel algorithm, the Perturbed Proximal Preconditioned SPIDER algorithm (3P-SPIDER), designed to solve finite sum non-convex composite optimization. It is a stochastic Variable Metric Forward-Backward algorithm, which allows approximate preconditioned forward operator and uses a variable metric proximity operator as the backward operator; it also proposes a mini-batch strategy with variance reduction to address the finite sum setting. We show that 3P-SPIDER extends some Stochastic preconditioned Gradient Descent-based algorithms and some Incremental Expectation Maximization algorithms to composite optimization and to the case the forward operator can not be computed in closed form. We also provide an explicit control of convergence in expectation of 3P-SPIDER, and study its complexity in order to satisfy the epsilon-approximate stationary condition. Our results are the first to combine the composite non-convex optimization setting, a variance reduction technique to tackle the finite sum setting by using a minibatch strategy and, to allow deterministic or random approximations of the preconditioned forward operator. Finally, through an application to inference in a logistic regression model with random effects, we numerically compare 3P-SPIDER to other stochastic forward-backward algorithms and discuss the role of some design parameters of 3P-SPIDER.
translated by 谷歌翻译
Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pre-trained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.
translated by 谷歌翻译
In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ). We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
translated by 谷歌翻译
我们考虑在以$ s $状态的地平线$ h $和$ a $ ACTIVE的偶发性,有限的,依赖于阶段的马尔可夫决策过程的环境中进行强化学习。代理商的性能是在与环境互动以$ t $插件互动后的遗憾来衡量的。我们提出了一种乐观的后验抽样算法(OPSRL),这是一种简单的后验抽样变体,仅需要许多后样品对数,$ h $,$ s $,$ a $和$ t $ a $ h $ s $ s $ a $ a $和$ t $一对。对于OPSRL,我们保证最多可容纳订单的高概率遗憾,$ \ wideTilde {\ mathcal {o}}}(\ sqrt {h^3sat})$忽略$ \ text {poly} \ log(hsat)$项。新型的新型技术成分是线性形式的新型抗浓缩不等式,可能具有独立感兴趣。具体而言,我们将Alfers and Dinges [1984]的Beta分布的基于正常近似的下限扩展到Dirichlet分布。我们的界限匹配订单$ \ omega(\ sqrt {h^3sat})$的下限,从而回答了Agrawal和Jia [2017b]在情节环境中提出的空旷问题。
translated by 谷歌翻译
重要性采样(IS)是一种使用来自建议分布和相关重要性权重的独立样本在目标分布下近似期望的方法。在许多应用中,只有直到归一化常数才知道目标分布,在这种情况下,可以使用自称为(SNIS)。虽然自我正态化的使用可能会对估计量的分散产生积极影响,但它引入了偏见。在这项工作中,我们提出了一种新方法BR-SNIS,其复杂性与SNI的复杂性基本相同,并且显着降低了偏见而不增加差异。这种方法是一种包装器,从某种意义上说,它使用了与SNIS相同的建议样本和重要性权重,但巧妙地使用了迭代采样(ISIR)重新采样(ISIR)来形成估算器的偏置版本。我们为提出的算法提供了严格的理论结果,包括新的偏见,方差和高概率界限,这些算法由数值示例进行了说明。
translated by 谷歌翻译
本文提供了具有固定步骤大小的线性随机近似(LSA)算法的有限时间分析,这是统计和机器学习中的核心方法。 LSA用于计算$ d $ - 二维线性系统的近似解决方案$ \ bar {\ mathbf {a}}} \ theta = \ bar {\ mathbf {b}} $ a}},\ bar {\ mathbf {b}})$只能通过(渐近)无偏见的观察来估算$ \ {(\ m athbf {a}(z_n),\ mathbf {b} {n \ in \ mathbb {n}} $。我们在这里考虑$ \ {z_n \} _ {n \ in \ mathbb {n}} $是i.i.d.序列或统一的几何千古马尔可夫链,并得出了$ p $ - 大小写的不等式和高概率界限,用于LSA及其polyak-ruppert平均版本定义的迭代。更确切地说,我们建立订单$(p \ alpha t _ {\ pereratatorName {mix}}}))^{1/2} d^{1/p} $在$ p $ - LSA的最后一个迭代的$ p $ - 。在此公式中,$ \ alpha $是该过程的步骤大小,$ t _ {\ operatatorName {mix}} $是基础链的混合时间($ t _ {\ operatotorname {mix {mix}} = 1 $ in I.I.D.设置中的1 $ )。然后,我们证明了迭代的polyak-ruppert平均序列上的有限时间实例依赖性边界。这些结果是明确的,从某种意义上说,我们获得的领先术语匹配局部渐近minimax限制,包括对参数$(d,t _ {\ operatorname {mix}})$的紧密依赖性在更高的术语中。
translated by 谷歌翻译
本文研究了用于训练过度参数化制度中的贝叶斯神经网络(BNN)的变异推理(VI),即当神经元的数量趋于无穷大时。更具体地说,我们考虑过度参数化的两层BNN,并指出平均VI训练中的关键问题。这个问题来自于证据(ELBO)的下限分解为两个术语:一个与模型的可能性函数相对应,第二个对应于kullback-leibler(KL)差异(KL)差异。特别是,我们从理论和经验上都表明,只有当根据观测值和神经元之间的比率适当地重新缩放KL时,在过度参数化制度中,这两个术语之间存在权衡。我们还通过数值实验来说明我们的理论结果,这些实验突出了该比率的关键选择。
translated by 谷歌翻译